load("assignment3.rdata")
library(survey)
library(dplyr)
library(ggplot2)
library(stringr)
library(broom)

Background Questions:

1.) How are the phenotypes (glucose and BMI) assessed in the NHANES? (1 point)

ANS:
LBXGLU: Fasting Glucose (mg/dL):
Participants aged 12 years and older who were examined in the morning session were tested. Glucose concentration was determined by a hexokinase method. It is an endpoint enzymatic method with a sample blank correction.

Blood specimens were processed, stored and shipped to Fairview Medical Center Laboratory at the University of Minnesota, Minneapolis Minnesota for analysis. Detailed specimen collection and processing instructions are discussed in the NHANES LPM.

_BMXBMI: Body Mass Index (kg/m**2)_

BMI values are calculated for National Health and Nutrition Examination Survey (NHANES) participants using measured height and weight values as follows: weight (kilograms)/height squared (meters)

All survey participants were eligible for the body measurement component. The body measurement data were collected by trained health technicians. The health technician was accompanied by a recorder during each body measurement examination. The respondent’s age, at the time of the screening interview, determined the body measurement examination protocol for survey participants

2.) What does representative mean with respect to NHANES? Why is it a different than say, the

Framingham Heart Study or the Nurses Health Study? (1 point)

ANS:
The NHANES sample for the survey is selected to represent the U.S. population of all ages. To produce reliable statistics, NHANES over-samples persons 60 and older, African Americans, and Hispanics.

While other studies, like Framingham Heart Study or Nurses Health Study are representative of a particular subgroup of population.

4.) How are the environmental exposures measured in the NHANES? Choose a few examples (up to 5) of

different types of biomarkers of environmental exposures from the ExposureList array and query the NHANES website for the assay description of the biomarker (5 points).

  1. Cadmium, Lead, & Total Mercury - Blood Component

Environmental exposures:
- LBXBCD - Cadmium (ug/L)

Description: A cadmium assay is performed to identify cases of cadmium toxicity. Occupational exposure is the most common cause of elevated cadmium levels.

  • LBXBPB - Lead (ug/dL) Description: Lead is a known environmental toxin that has been shown to deleteriously affect the nervous, hematopoietic, endocrine, renal, and reproductive systems. In young children, lead exposure is a particular hazard because children more readily absorb lead than do adults, and children’s developing nervous systems also make them more susceptible to the effects of lead. The primary sources of exposure for children are lead-laden paint chips and dust as a result of deteriorating lead-based paint. The risk for lead exposure is disproportionately higher for children who are poor, non-Hispanic black, living in large metropolitan areas, or living in older housing. Among adults, the most common high exposure sources are occupational. Blood lead levels measured in previous NHANES programs have been the cornerstone of lead exposure surveillance in the U.S. The data have been used to document the burden and dramatic decline of elevated blood lead levels, to promote the reduction of lead use, and to help to redefine national lead poisoning prevention guidelines, standards, and abatement activities.

Target: Samples: Participants aged 1 year and older who do not meet any of the exclusion criteria are eligible.

Data Process: Whole blood specimens are processed, stored, and shipped to the Division of Laboratory Sciences, National Center for Environmental Health, and Centers for Disease Control and Prevention for analysis.

  1. Metals - Urine (UHM_D) Components

Environmental Exposures: - URXUCO - Cobalt, urine (ug/L) - URXUCS - Cesium, urine (ug/L) Description: Trace metals have been associated with adverse health effects in occupational studies or laboratory studies, but have not been monitored in general population groups.

This method is used to achieve rapid and accurate quantifications of multiple elements of toxicological and nutritional interest. The method is sensitive and rapid enough to screen urine specimens from subjects suspected to be exposed to a number of important toxic elements or to evaluate environmental or other nonoccupational exposure to these same elements.

Target Samples: Participants aged 6 years and older who met the subsample requirements.
Data Process:
Urine specimens are processed, stored, and shipped to the Division of Environmental Health Laboratory Sciences, National Center for Environmental Health, Centers for Disease Control and Prevention for analysis

  1. Cotinine - Serum (COT_D) Component

Environmental Exposures: LBXCOT - Cotinine (ng/mL)

Description:
The specific aims of the component are:

  • to measure the prevalence and extent of tobacco use;
  • to estimate the extent of exposure to environmental tobacco smoke (ETS), and determine trends in exposure to ETS;
  • to describe the relationship between tobacco use (as well as exposure to ETS) and chronic health conditions, including respiratory and cardiovascular diseases.

The tobacco component for NHANES will include questionnaire items on current and past use of cigarettes, pipes, cigars and smokeless tobacco. Exposure to ETS at home and at work and in-utero ETS exposure among children will also be obtained. ETS exposure will also be assessed for examinees 3 years of age and older through the measurement of serum cotinine, a metabolite of nicotine. In addition, use of nicotine replacement products (e.g., gum and patch) will be collected using questionnaires.

Target Samples: Participants aged 3 years and older who do not meet any of the exclusion criteria are eligible.

Data Process:
Serum specimens are processed, stored, and shipped to the Division of Environmental Health Laboratory Sciences, National Center for Environmental Health, Centers for Disease Control and Prevention for analysis

5.) Let’s explore the phenotypes, body mass index (BMXBMI) and fasting glucose (LBXGLU) with respect to

demographic characteristics of your population in the training dataset. (36 points total from a.)-i.); 37 points total)

a.) How many individuals are there with BMI values? Draw a histogram of BMI and describe the

mean and median. Is the distribution skewed? (1 point)

BMI <- NHData.train  %>% filter(!is.na(BMXBMI)) 

BMI <- BMI %>% mutate(sex = case_when( male == 1 ~ "male", 
                                         female == 1 ~ "female"),
                        race = case_when(white == 1 ~ "white",
                                         black == 1 ~ "black",
                                         mexican == 1 ~ "mexican",
                                         other_hispanic == 1 ~ "hispanic",
                                         other_eth == 1 ~ "others"))
BMI$race <- factor(BMI$race, levels = c("white","black","mexican","hispanic","others"))

dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=BMI)
svyhist(~ BMXBMI, dsn)

svymean(~ BMXBMI, dsn)
##          mean     SE
## BMXBMI 25.839 0.1205
svyquantile( ~ BMXBMI, dsn, quantiles = 0.5)
##          0.5
## BMXBMI 25.19
dim(BMI)
## [1] 17472   195

ANS: 17472, mean: 25.839 Median: 25.19 The distribution is right skewed

b.) Plot BMI vs age. (1 point)

svyplot( BMXBMI ~ RIDAGEYR, design = dsn)

c.) Plot a boxplot of BMI versus sex. (1 point)

svyboxplot( BMXBMI ~ sex, design = dsn)

d.) Plot a boxplot of BMI versus race/ethnicity. (1 point)

svyboxplot(BMXBMI ~ white + black + mexican + other_hispanic + other_eth, dsn, 
           names=c("White","Black","Mexican", "Other Hispanic", "Others"), 
           xlab = "Race/Ethnicity",
           ylab = "BMI")

e.) Plot BMI versus an indicator of socioecononomic status, the income to poverty ratio

(INDFMPIR). The income-to-poverty ratio is a the household income divided by the household poverty level for a given survey year. Therefore a Income-to-poverty ratio of 1 means the individual has household income equal to the poverty level.

svyplot(BMXBMI ~ INDFMPIR, dsn , xlab = "Income-to-poverty ratio", ylab = "BMI")

f.) Describe your findings from the plots a-e qualitatively. (1 point)

ANS: 1. BMI seems to slightly increase with age in the beginning, and achieve it’s peak between 40 to 50, then the decrease with age.
2. BMI seems no significantly different between sex, and race.
3. BMI slightly increased with income-to-poverty ratio, but may not be significant.

g.) What statistical linear model would you use to test the significance and effect of your findings

in (f)? (2 points)

ANS: Linear regression. The coefficient of the variables will tell the effects, and the p-value will tell the significance.

h.) Using the model you described in (g), estimate the following:

i.) Change in BMI for a 1 year change in age and significance of this association. (3 points)
fit <- svyglm(BMXBMI ~ RIDAGEYR, data = BMI, design = dsn)
fit %>% tidy()
##          term   estimate   std.error statistic      p.value
## 1 (Intercept) 20.6623230 0.144659045 142.83464 1.233947e-41
## 2    RIDAGEYR  0.1455033 0.003440554  42.29064 6.435068e-27

ANS: A 1 year change in age increase the BMI by 0.145, p-value is \(6.4 * 10^{-27}\), which is significant.

ii.) Average BMI in males versus females and significance. (3 points)
fit <- svyglm(BMXBMI ~ sex, data = BMI, design = dsn)
fit %>% tidy()
##          term   estimate std.error statistic      p.value
## 1 (Intercept) 26.0653755 0.1578484 165.12912 2.136003e-43
## 2     sexmale -0.4646725 0.1395459  -3.32989 2.445986e-03

ANS: Average BMI in female is 26.06 and in male is 25.60. p-value is \(2.45 * 10^{-3}\), which is significant

iii.) Average BMI in Non-Hispanic Black versus Whites, Mexican American versus Whites, and Other Hispanic vs. White, and Other versus White (4 points).
fit <- svyglm(BMXBMI ~ race, data = BMI, design = dsn)
fit %>% tidy()
##           term    estimate std.error    statistic      p.value
## 1  (Intercept) 25.87563473 0.1531379 168.96953558 9.399238e-40
## 2    raceblack  0.63332347 0.2267175   2.79344726 9.860121e-03
## 3  racemexican -0.67685827 0.2064778  -3.27811575 3.066543e-03
## 4 racehispanic -0.02531159 0.3276991  -0.07724035 9.390472e-01
## 5   raceothers -1.08260966 0.5238973  -2.06645392 4.929230e-02

ANS: Average BMI and significance:
White: 25.876 (reference) Black: 26.509, p-value = 0.0098 , BMI is significantly different between black and white
Mexian:25.199, p-value = 0.0031 , BMI is significantly different between mexican adnd white
Other Hispanic: 25.85, p-value = 0.939, BMI is not significantly different between other hispanic and white Others: 24.793, p-value = 0.0493, BMI is borderline significantly different between others and white

i.) Repeat 5 (a-h) with fasting glucose (LBXGLU) (18 points).

GLU <- NHData.train  %>% filter(!is.na(LBXGLU)) 

GLU <- GLU %>% mutate(sex = case_when( male == 1 ~ "male", 
                                         female == 1 ~ "female"),
                        race = case_when(white == 1 ~ "white",
                                         black == 1 ~ "black",
                                         mexican == 1 ~ "mexican",
                                         other_hispanic == 1 ~ "hispanic",
                                         other_eth == 1 ~ "others"))
GLU$race <- factor(GLU$race, levels = c("white","black","mexican","hispanic","others"))

dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=GLU)

i.a)

svyhist(~ LBXGLU, dsn)

svymean(~ LBXGLU, dsn)
##          mean     SE
## LBXGLU 100.23 0.5817
svyquantile( ~ LBXGLU, dsn, quantiles = 0.5)
##         0.5
## LBXGLU 94.6
dim(GLU)
## [1] 6476  195

ANS: There are 6476 individuals without NA, mean: 100,23 Median: 94.6 The distribution is more right skewed than BMI.

i.b)

svyplot( LBXGLU ~ RIDAGEYR, design = dsn)

i.c)

svyboxplot(LBXGLU ~ sex, design = dsn)

i.d)

svyboxplot(LBXGLU ~ white + black + mexican + other_hispanic + other_eth, dsn, 
           names=c("White","Black","Mexican", "Other Hispanic", "Others"), 
           xlab = "Race/Ethnicity",
           ylab = "Glucose")

i.e)

svyplot(LBXGLU ~ INDFMPIR, dsn , xlab = "Income-to-poverty ratio", ylab = "Glucose")

i.f) ANS: 1. Glucose do not have apparent linear relationship with age, but as age increases, the variation of glucose levels also increase 2. Through the boxplot it is hard to tell the different between sex, and race of glucose level.
3. No significant linear relationship is observed between glucose and income-to-poverty ratio

i.g) ANS:
The same as bmi, we will use linear regression. The coefficient of the variables will tell the effects, and the p-value will tell the significance.

i.h.i) ANS:

fit <- svyglm(LBXGLU ~ RIDAGEYR , design = dsn)
fit %>% tidy()
##          term   estimate  std.error statistic      p.value
## 1 (Intercept) 84.0609374 1.09386378  76.84772 4.066568e-34
## 2    RIDAGEYR  0.3898802 0.02727663  14.29356 2.166597e-14

ANS: A 1 year change in age increase the glucose by 0.390, p-value is \(2.167 * 10^{-14}\), which is significant.

i.h.ii)

fit <- svyglm(LBXGLU ~ sex , design = dsn)
fit %>% tidy()
##          term  estimate std.error  statistic      p.value
## 1 (Intercept) 97.865578 0.7188393 136.143887 4.719289e-41
## 2     sexmale  4.832581 0.7977256   6.057948 1.566102e-06

ANS: Average glucose in female is 97.866 and in male is 102,699. p-value is \(1.56 * 10^{-6}\), which is significant

i.h.iii.)

fit <- svyglm(LBXGLU ~ race , design = dsn)
fit %>% tidy()
##           term    estimate std.error    statistic      p.value
## 1  (Intercept) 100.0431624 0.6599272 151.59727005 1.412246e-38
## 2    raceblack  -1.4418470 1.2949692  -1.11342190 2.761185e-01
## 3  racemexican   0.6420659 1.0560965   0.60796135 5.486969e-01
## 4 racehispanic   4.3031206 3.0142098   1.42761151 1.657767e-01
## 5   raceothers   0.2109915 3.3379627   0.06320968 9.501022e-01

ANS:
Average Glucose and significance:
White: 100.04 (reference) Black: 98.60, p-value = 0.276 , BMI is not significantly different between black and white
Mexian:100.68, p-value = 0.548 , BMI is not significantly different between mexican adnd white
Other Hispanic: 104.34, p-value = 0.166, BMI is not significantly different between other hispanic and white Others: 100.254, p-value = 0.0493, BMI is not significantly different between others and white

The glucose is not statiscally different between different races

j.) How do the demographic characteristics qualitatively compare with those with Body Mass

Index? (e.g., what demographic characteristics are associated with both BMI and glucose?) (1 point) ANS:
1. Both BMI and glucose increase with age and are both significant. 2. BMI in female is larger than in male, while the glucose is the opposite - glucose in male is higher than in female. Both of them are significantly different. 3. Black and Mexican have significantly higher BMI than White, but the glucose have no significant difference between all groups.

6.) In the previous assignment, you got your hands dirty with binary data of genotypes from a GWAS array. In this question, we will explore how biomarkers of exposures are different that ‘cleaner’ measures of genetic variants.

Specifically, we will explore two biomarkers of exposure and nutrition, including a heavy metal exposure that recently made the headlines in Flint, Michigan: serum Lead (LBXBPB). (13 points total)

BPB <- NHData.train %>% filter(!is.na(LBXBPB))

BPB <- BPB %>% mutate(sex = case_when( male == 1 ~ "male", 
                                         female == 1 ~ "female"),
                        race = case_when(white == 1 ~ "white",
                                         black == 1 ~ "black",
                                         mexican == 1 ~ "mexican",
                                         other_hispanic == 1 ~ "hispanic",
                                         other_eth == 1 ~ "others"))
BPB$race <- factor(BPB$race, levels = c("white","black","mexican","hispanic","others"))

dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=BPB)

a.) Plot a histogram of serum lead (LBXBPB) and qualitatively describe the shape of the

distribution. (1 point)

svyhist(~ LBXBPB, dsn)

svymean(~ LBXBPB, dsn)
##          mean     SE
## LBXBPB 1.9738 0.0293
svyquantile( ~ LBXBPB, dsn, quantiles = 0.5)
##        0.5
## LBXBPB 1.5
dim(BPB)
## [1] 16915   195

ANS: The distribution is highly right skewed, mean serum lead is 1.97, and the median is 1.5

b.) Investigators attempt to make this distribution more “normal” by applying a transformation –

what transformation would you use here (hint: think exponential decay)? Why would you want to make a dependent variable look more “normal” (hint: think interpretation) (2 points)

ANS: Take log transformation can make the data distribution less skewed. Making the data looks more normal with log transform let us able to fit our data into some general linear model, like t-test or regression.

c.) Plot LBXBPB vs age (1 point)

svyplot(log(LBXBPB) ~ RIDAGEYR, dsn, xlab = "Age", ylab = "Lead")

d.) Plot a boxplot of LBXBPB versus sex (1 point)

svyboxplot(log(LBXBPB) ~ sex, dsn, xlab = "Gender", ylab = "Lead")

#### e.) Plot a boxplot of LBXBPB versus race/ethnicity (1 point)

svyboxplot(log(LBXBPB) ~ race, dsn, xlab = "Race", ylab = "Lead")

#### f.) Plot a plot of LBXBPB versus an indicator of socioecononomic status, the income to poverty ratio (INDFMPIR) (1 point)

svyplot(log(LBXBPB) ~ INDFMPIR, dsn, xlab = "Income-to-Poverty Ratio", ylab = "Lead")

g.) Using a linear regression model, estimate the following:

i.) Change in LBXBPB for a 1 year change in age and significance of this association (2

points).

fit <- svyglm(log(LBXBPB) ~ RIDAGEYR, data = BPB, design = dsn)
fit %>% tidy()
##          term    estimate    std.error statistic      p.value
## 1 (Intercept) 0.096896357 0.0295123528  3.283247 2.755865e-03
## 2    RIDAGEYR 0.009295286 0.0005868151 15.840229 1.657708e-15

ANS: A 1 year change in age will increase log(LBXBPB) by 0.0093, which is about 1.01 in lead levels. P-value is \(1.65*10^{-15}\), which is significant.

ii.) Average LBXBPB in males versus females and significance (2 points).
fit <- svyglm(log(LBXBPB) ~ sex, data = BPB, design = dsn)
fit %>% tidy()
##          term  estimate  std.error statistic      p.value
## 1 (Intercept) 0.2432349 0.01464807  16.60526 5.004044e-16
## 2     sexmale 0.3940273 0.01232012  31.98242 1.375464e-23

ANS: Average Lead in female is \(e^{0.243} = 1.275\), in male is \(e^{0.637} = 1.891\), p-value is \(1.37 * 10^{-23}\), which is significantly different

iii.) Average LBXBPB in Non-Hispanic Black) versus Whites, Mexican American versus

Whites, and Other Hispanic vs. White, and Other versus White (2 points).

fit <- svyglm(log(LBXBPB) ~ race, data = BPB, design = dsn)
fit %>% tidy()
##           term    estimate  std.error  statistic      p.value
## 1  (Intercept)  0.41779385 0.01418008 29.4634342 6.199649e-21
## 2    raceblack  0.14836175 0.02021919  7.3376714 1.094140e-07
## 3  racemexican  0.06544387 0.03388176  1.9315367 6.482823e-02
## 4 racehispanic -0.01582878 0.04373909 -0.3618909 7.204745e-01
## 5   raceothers -0.06601124 0.02856402 -2.3109922 2.936507e-02
fit %>% tidy() %>% select(estimate) %>% exp()
##    estimate
## 1 1.5186076
## 2 1.1599324
## 3 1.0676328
## 4 0.9842958
## 5 0.9361203

ANS: Average Lead levels and significance:
White: 1.518 (reference) Black: 1.159, p-value = \(1.09 * 10^{-7}\) , Lead levels is significantly different between black and white
Mexian:1.067, p-value = 0.065 , Lead levels is not significantly different between mexican adnd white
Other Hispanic: 0.984, p-value = 0.720, Lead levels is not significantly different between other hispanic and white Others: 0.936, p-value = 0.0294, Lead levels is significantly different between others and white

7.) How are nutrient factors distributed? Repeat question 5 a, c, d, e with serum vitamin D (LBXVID) (4

points)

7.a)

VitD <- NHData.train  %>% filter(!is.na(LBXVID)) 

VitD <- VitD %>% mutate(sex = case_when( male == 1 ~ "male", 
                                         female == 1 ~ "female"),
                        race = case_when(white == 1 ~ "white",
                                         black == 1 ~ "black",
                                         mexican == 1 ~ "mexican",
                                         other_hispanic == 1 ~ "hispanic",
                                         other_eth == 1 ~ "others"))
VitD$race <- factor(VitD$race, levels = c("white","black","mexican","hispanic","others"))

dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=VitD)
svyhist(~ LBXVID, dsn)

svymean(~ LBXVID, dsn)
##         mean    SE
## LBXVID 23.76 0.385
svyquantile( ~ LBXVID, dsn, quantiles = 0.5)
##        0.5
## LBXVID  23
dim(VitD)
## [1] 7807  195

ANS: There are 7807 individuals with Vitamin D. Mean:23.76, Media:23, The distribution is normally distributed, with mildly right skewed

7.c)

svyboxplot(LBXVID ~ sex, dsn, xlab = "sex", ylab = "Vitamin D")

7.d)

svyboxplot(LBXVID ~ race, dsn, xlab = "race", ylab = "Vitamin D")

7.e)

svyplot(LBXVID ~ INDFMPIR, dsn , xlab = "Income-to-poverty ratio", ylab = "BMI")

8.) Could demographic variables be confounders in a test of association between Lead and BMI or glucose? Mediators? Please justify. What about vitamin D? (4 points)

ANS:

Demographic variables can be confounders between Lead and BMI. For example, sex is associate with Lead level and BMI, therefore, sex can be a counfounder. Mediators has to be on the causative pathway between exposure(Lead) and outcome(BMI), demographic variables are less likely to be influence by lead level.

Vitamin D shows assoication with age, sex and race, so demorgraphic variables can also be counders between VitD and BMI. Demographic variables, again, is unlikely to a mediator. However, if we look at the association between lead and BMI, vitamin D could potentially be a mediator, if there are some biological evidence being proven. For example, lead level could influence the absorbtion of vitamin D, and cause the BMI change.

Executing environment-wide associations (33 points):

1.)

Now we will test each of the exposures in ExposureList for linear association with a quantitative or continuous trait in the training and testing datasets separately. Write R code, from scratch, to execute a EWAS called ewas.R for a given dataset (e.g., training or testing) (code and output: 10 total points). Your script will input either a flag to execute the EWAS on the training or testing datasets and output a .csv file that contains the exposure ID (e.g., LBXBPB),exposure name, phenotype name,estimate,standard error, pvalue,false discovery rate (FDR)

As implied in the output above, your code should execute association that tests the association between each exposure in ExposureList and 1 SD of the phenotype (e.g., BMI) for a 1 standard deviation change in the logarithm base 10 of the exposure value. The dependent variable is the phenotype, and the independent variable is an exposure. Adjust all models by age, race/ethnicity, income/poverty ratio (INDFMPIR). Using your R code, execute the following:

A.) EWAS on 1 standard deviation change of Body Mass Index (BMXBMI) in the training and testing datasets separately. To scale BMXBMI, you can use the scale command in the linear modeling function. Your output should be named bmi_train.csv and bmi_test.csv.

flag <- "test"
if(flag == "train"){
  data <- NHData.train
}else{
  data <- NHData.test
}

data <- data  %>% filter(!is.na(BMXBMI)) 

data <- data %>% mutate(sex = case_when( male == 1 ~ "male", 
                                         female == 1 ~ "female"),
                        race = case_when(white == 1 ~ "white",
                                         black == 1 ~ "black",
                                         mexican == 1 ~ "mexican",
                                         other_hispanic == 1 ~ "hispanic",
                                         other_eth == 1 ~ "others"))

dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data= data)
bmi <- data.frame(matrix(vector(), 0, 6))

for(exposure in ExposureList){
  f <- str_glue("scale(BMXBMI) ~ scale(log10({exposure} + 1e-07)) + RIDAGEYR + sex + race + INDFMPIR")
  exp_name <- ExposureDescription %>% filter(var == exposure) %>% select(var_desc) %>% distinct(var_desc) %>% slice(1) %>% pull(var_desc)
  # Check there is match for two data
  if(length(table(data$sex[!is.na(data[,exposure])]))<=1 | length(table(data[,exposure])) <= 1){
    d <- data.frame(exposure, exp_name, "BMXBMI", fit.estimate=NA,  fit.std.error=NA, fit.p.value=NA)
  }else{
    fit <- svyglm(f, data = data, dsn) %>% tidy() 
    target <- str_glue('scale(log10({exposure} + 1e-07))')
    if(target %in% fit$term){
      fit <- fit %>% filter(term == target)
      d <- data.frame(exposure, exp_name, "BMXBMI", fit$estimate, fit$std.error, fit$p.value)
    }else{
      d <- data.frame(exposure, exp_name, "BMXBMI", fit.estimate=NA,  fit.std.error=NA, fit.p.value=NA)
    }
  }
  bmi <- rbind(bmi, d)
}

colnames(bmi) <- c("Exposure ID", "Exposure Name", "Phenotype Name", "Estimate", "Standard Error", "P-value")

bmi$FDR <- p.adjust(bmi$`P-value`, method = "fdr")

output <- str_glue("bmi_{flag}.csv")
write.csv(bmi,output, row.names = FALSE)

B.) EWAS in 1 standard deviation of the logarithm base e of fasting blood glucose (LBXGLU) in

the training and testing datasets separately. To scale fasting glucose, you can use the scale and log command in the linear modeling function. Your output should be named fasting_glucose_train.csv and fasting_glucose_test.csv.

flag <- "test"
if(flag == "train"){
  data <- NHData.train
}else{
  data <- NHData.test
}

data <- data  %>% filter(!is.na(LBXGLU)) 

data <- data %>% mutate(sex = case_when( male == 1 ~ "male", 
                                         female == 1 ~ "female"),
                        race = case_when(white == 1 ~ "white",
                                         black == 1 ~ "black",
                                         mexican == 1 ~ "mexican",
                                         other_hispanic == 1 ~ "hispanic",
                                         other_eth == 1 ~ "others"))

dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data= data)
glu <- data.frame(matrix(vector(), 0, 6))

for(exposure in ExposureList){
  f <- str_glue("scale(log(LBXGLU)) ~ scale(log10({exposure} + 1e-07)) + RIDAGEYR + sex + race + INDFMPIR")
  exp_name <- ExposureDescription %>% filter(var == exposure) %>% select(var_desc) %>% distinct(var_desc) %>% slice(1) %>% pull(var_desc)
  # Check there is match for two data
  if(length(table(data$sex[!is.na(data[,exposure])]))<=1 | length(table(data[,exposure])) <= 1){
    d <- data.frame(exposure, exp_name, "LBXGLU", fit.estimate=NA,  fit.std.error=NA, fit.p.value=NA)
  }else{
    fit <- svyglm(f, data = data, dsn) %>% tidy() 
    target <- str_glue('scale(log10({exposure} + 1e-07))')
    if(target %in% fit$term){
      fit <- fit %>% filter(term == target)
      d <- data.frame(exposure, exp_name, "LBXGLU", fit$estimate, fit$std.error, fit$p.value)
    }else{
      d <- data.frame(exposure, exp_name, "LBXGLU", fit.estimate=NA,  fit.std.error=NA, fit.p.value=NA)
    }
  }
  glu <- rbind(glu, d)
}

colnames(glu) <- c("Exposure ID", "Exposure Name", "Phenotype Name", "Estimate", "Standard Error", "P-value")

glu$FDR <- p.adjust(glu$`P-value`, method = "fdr")

output <- str_glue("glu_{flag}.csv")
write.csv(glu,output, row.names = FALSE)

C.) Use the p.adjust function in R to estimate the Benjamini-Hochberg False Discovery Rate.

You can execute your ewas.R script on Orchestra OR on your local computer.

2.) Analysis of EWAS results:

a.) Produce a volcano plot of the results for glucose and body mass index by plotting the association size, or estimate on the x-axis and the -log10(p-value) on the yaxis. What is this plot depicting? Why is it useful to scale the dependent and independent variables before running the model? (1 point)

bmi <- read.csv("bmi_train.csv")
bmi %>% filter(!is.na(Estimate)) %>% mutate(threshold = ifelse(Estimate >= 0.2 | Estimate <= -0.2, "Out", "In")) %>%  ggplot(aes(x=Estimate, y=-log10(FDR))) +
geom_point(aes(col = threshold), size=1.2) +
scale_colour_manual(values = c("Out" = "Red", "In" = "Black"))

glu <- read.csv("glu_train.csv")
glu %>% filter(!is.na(Estimate)) %>% mutate(threshold = ifelse(Estimate >= 0.2 | Estimate <= -0.2, "Out", "In")) %>%  ggplot(aes(x=Estimate, y=-log10(FDR))) +
geom_point(aes(col = threshold), size=1.2) +
scale_colour_manual(values = c("Out" = "Red", "In" = "Black"))

ANS: The vocalno plot shows the significance and effect size at the same time. The higher the point is means it’s more significant, and the outer it is mean it has larger effect size. The reason we scale the variables is because now we can compare all the variables wit different absolute value on the same scale.

b.) What is the pvalue that achieves FDR of 5%? Of 1%? What does FDR at a certain threshold

mean? (2 points)

files <- c("bmi_train.csv","bmi_test.csv","glu_train.csv","glu_test.csv")

for(file in files){
  data <- read.csv(file)
  order_data <- data[order(data$FDR),c("P.value","FDR")]
  print(order_data[max(which(order_data$FDR < 0.05)),])
  print(order_data[max(which(order_data$FDR < 0.01)),])
}
##       P.value        FDR
## 86 0.02112903 0.04724687
##         P.value         FDR
## 125 0.002813355 0.009846743
##        P.value        FDR
## 106 0.02162014 0.04902595
##         P.value         FDR
## 108 0.003080436 0.009184263
##         P.value       FDR
## 107 0.004023426 0.0489001
##          P.value         FDR
## 103 0.0001543363 0.008128381
##        P.value        FDR
## 13 0.002495111 0.04435753
##          P.value         FDR
## 139 0.0004497418 0.008994836

ANS: BMI Train: FDR 5%: 0.0211 1%: 0.0028 BMI Test: FDR 5%: 0.0216 1%: 0.0031 Glucose Train: FDR 5%: 0.0040 1%: 0.00015 Glucose Test: FDR 5%: 0.0025 1%: 0.00045

FDR 5% threshold means that in multiple testing, we want to control the expected propotion of the expected proportion of “discoveries” (rejected null hypotheses) that are false (incorrect rejections) to be 5%, and 1% . Same for FDR 1%.

c.) Write code to filter out “replicated” findings using the following heuristic: FDR significance of 10% in the training dataset, p-value < 0.05 in the testing dataset, and concordant directionality of associations (e.g., both are >0 OR both are less than <0). How many findings were replicated in body mass index? And in fasting glucose? (5 points)

bmi_train <- read.csv("bmi_train.csv")
bmi_test <- read.csv("bmi_test.csv")

bmi_train_list <- bmi_train %>% filter(FDR < 0.1) 
bmi_test_list <- bmi_test %>% filter(P.value < 0.05) 

bmi_list <- bmi_train_list %>% 
              inner_join(bmi_test_list, by = "Exposure.ID")  %>% 
              filter(Estimate.x * Estimate.x > 0)
dim(bmi_list)
## [1] 63 13
glu_train <- read.csv("glu_train.csv")
glu_test <- read.csv("glu_test.csv")

glu_train_list <- glu_train %>% filter(FDR < 0.1) 
glu_test_list <- glu_test %>% filter(P.value < 0.05)

glu_list <- glu_train_list %>% 
              inner_join(glu_test_list, by = "Exposure.ID")  %>% 
              filter(Estimate.x * Estimate.x > 0)

dim(glu_list)
## [1]  9 13

ANS: There are 63 replicated findings in BMI, and 9 replicated findings in glucose.

d.) Interpret analytically the top 3 most findings, ranked by low to high FDR in body mass index and fasting glucose. Specifically, how much does (a) body mass index and (b) fasting glucose change with respect to the top finding? Write down your answer in the units of estimate. (3 points)

bmi_list %>% arrange(FDR.x) %>% head()
##   Exposure.ID            Exposure.Name.x Phenotype.Name.x Estimate.x
## 1      LBXGTC        g-tocopherol(ug/dL)           BMXBMI  0.2276420
## 2      LBXB12 Vitamin B12, serum (pg/mL)           BMXBMI -0.2042356
## 3      LBXFOL      Folate, serum (ng/mL)           BMXBMI -0.1892721
## 4      LBXBPB               Lead (ug/dL)           BMXBMI -0.2016765
## 5      LBXHPE  Heptachlor Epoxide (ng/g)           BMXBMI  0.2688981
## 6      LBXBEC    trans-b-carotene(ug/dL)           BMXBMI -0.2572972
##   Standard.Error.x    P.value.x        FDR.x            Exposure.Name.y
## 1      0.009970789 2.616338e-16 4.212304e-14        g-tocopherol(ug/dL)
## 2      0.010279947 4.284396e-15 2.542922e-13 Vitamin B12, serum (pg/mL)
## 3      0.009574916 4.738364e-15 2.542922e-13      Folate, serum (ng/mL)
## 4      0.015785288 2.272531e-11 9.146939e-10               Lead (ug/dL)
## 5      0.028521789 5.380970e-09 1.732672e-07  Heptachlor Epoxide (ng/g)
## 6      0.008080847 7.791652e-09 2.090760e-07    trans-b-carotene(ug/dL)
##   Phenotype.Name.y Estimate.y Standard.Error.y    P.value.y        FDR.y
## 1           BMXBMI  0.2354186       0.01046289 1.120786e-16 9.022329e-15
## 2           BMXBMI -0.1881993       0.01312390 1.210140e-12 2.164806e-11
## 3           BMXBMI -0.1821714       0.01031701 1.769596e-14 4.748415e-13
## 4           BMXBMI -0.2021204       0.01558564 8.831201e-12 1.421823e-10
## 5           BMXBMI  0.2908614       0.03256665 4.483683e-05 4.010405e-04
## 6           BMXBMI -0.2587954       0.01285841 1.167811e-15 6.267251e-14
glu_list %>% arrange(FDR.x) %>% head()
##   Exposure.ID                    Exposure.Name.x Phenotype.Name.x
## 1      LBXGTC                g-tocopherol(ug/dL)           LBXGLU
## 2      LBXBEC            trans-b-carotene(ug/dL)           LBXGLU
## 3      LBXCBC              cis-b-carotene(ug/dL)           LBXGLU
## 4      LBXCRY             b-cryptoxanthin(ug/dL)           LBXGLU
## 5      LBXLUZ Combined Lutein/zeaxanthin (ug/dL)           LBXGLU
## 6      LBXVID                  Vitamin D (ng/mL)           LBXGLU
##    Estimate.x Standard.Error.x    P.value.x        FDR.x
## 1  0.15445792       0.01577668 2.804497e-09 4.431105e-07
## 2 -0.10802779       0.01999974 1.006906e-03 2.656847e-02
## 3 -0.09134860       0.01751955 1.233767e-03 2.784788e-02
## 4 -0.08873553       0.01933813 2.517857e-03 3.978215e-02
## 5 -0.08408332       0.01914369 3.186797e-03 4.577400e-02
## 6 -0.14307110       0.03371682 3.824597e-03 4.890010e-02
##                      Exposure.Name.y Phenotype.Name.y  Estimate.y
## 1                g-tocopherol(ug/dL)           LBXGLU  0.14217226
## 2            trans-b-carotene(ug/dL)           LBXGLU -0.13970371
## 3              cis-b-carotene(ug/dL)           LBXGLU -0.14444696
## 4             b-cryptoxanthin(ug/dL)           LBXGLU -0.08109641
## 5 Combined Lutein/zeaxanthin (ug/dL)           LBXGLU -0.05561641
## 6                  Vitamin D (ng/mL)           LBXGLU -0.11891051
##   Standard.Error.y    P.value.y        FDR.y
## 1       0.01710819 3.121283e-08 2.497027e-06
## 2       0.01910467 2.537314e-07 1.353234e-05
## 3       0.01661959 1.451483e-08 2.322372e-06
## 4       0.01968215 4.497418e-04 8.994836e-03
## 5       0.01699393 3.479926e-03 5.061711e-02
## 6       0.01736755 7.063406e-07 2.825363e-05

BMI Top 3 findings: 1. LBXGTC g-tocopherol(ug/dL) : BMI increase by 0.2276420 SD, with 1 SD change 2. LBXB12 Vitamin B12, serum (pg/mL): BMI decrease by 0.2042356, with 1 SD change 3. LBXFOL Folate, serum (ng/mL): BMI decrease by 0.1892721, with 1 SD change

Glucose Top3 findings: 1. LBXGTC g-tocopherol(ug/dL): Glucose increase by 0.15445792 SD, with 1 SD change 2. LBXBEC trans-b-carotene(ug/dL): Glucose decrease by 0.10802779 SD, with 1 SD change 3. LBXCBC cis-b-carotene(ug/dL): Gluose decrease by 0.09134860 SD, with 1 SD change

e.) How similar (or dissimilar) are the estimates found in BMI and glucose respectively? Correlate

the estimates (from the training data) by plotting the estimates from body mass index on one axis and glucose on the other (each point is an estimate for a biomarker of exposure). What findings have replicated effects in the same direction? Different directions? What might this imply with respect to these phenotypes? (4 points)

similarity <- data.frame(term = bmi_train$Exposure.ID, bmi_estimate = bmi_train$Estimate, glu_estimate = glu_train$Estimate) %>% 
  mutate(direction = case_when(bmi_estimate * glu_estimate > 0 ~ "Same Direction",bmi_estimate * glu_estimate <0 ~"Different Direction")) %>% filter(!is.na(direction)) %>%
  mutate(replicate = case_when(term %in% bmi_list$Exposure.ID | term %in% glu_list$Exposure.ID ~ "Replicate",
                               TRUE ~ "Not Replicate")) %>% 
  mutate(label = case_when(replicate == "Replicate" & direction == "Same Direction" ~ "Replicate-Same",
                           replicate == "Replicate" & direction == "Different Direction" ~ "Replicate-Different",
                           TRUE ~ "Non-Replicate"))
ggplot(similarity, aes(x = bmi_estimate, y = glu_estimate, col = label)) + geom_point()

similarity %>% filter(label == "Replicate-Same") %>% select(term) %>% print()
##      term
## 1  LBXBPB
## 2  LBXCOT
## 3  URXUBA
## 4  URXUSB
## 5  LBXFOL
## 6  LBXVID
## 7  URXMEP
## 8  URXMZP
## 9  URXMHH
## 10 URXMOH
## 11 LBX074
## 12 LBX105
## 13 LBX118
## 14 LBXD04
## 15 LBXD05
## 16 LBXD07
## 17 LBXF04
## 18 LBXF05
## 19 LBXPCB
## 20 LBX099
## 21 LBX194
## 22 LBX196
## 23 LBD199
## 24 LBXBHC
## 25 LBXPDT
## 26 LBXTNA
## 27 LBXHPE
## 28 LBXDIE
## 29 URXP04
## 30 URXP06
## 31 URXP07
## 32 URXP17
## 33 LBXIRN
## 34 LBXALC
## 35 LBXBEC
## 36 LBXCBC
## 37 LBXCRY
## 38 LBXGTC
## 39 LBXLUZ
similarity %>% filter(label == "Replicate-Different") %>% select(term) %>% print()
##      term
## 1  URXUCD
## 2  URXUCS
## 3  URXUTL
## 4  LBXB12
## 5  LBX156
## 6  LBXHXC
## 7  LBX146
## 8  LBX153
## 9  LBX170
## 10 LBX178
## 11 LBX180
## 12 LBX187
## 13 LBX206
## 14 LBXMIR
## 15 URXP02
## 16 URXP19
## 17 LBXLYC
## 18 LBXRPL
## 19 LBXRST
## 20 LBXVIA
## 21 URX14D
## 22 LBXSPH
## 23 URXETL

ANS: There are 39 findings that have replicated effects in either bmi or glucose and have same direction. LBXBPB,LBXCOT,URXUBA,URXUSB,LBXFOL,LBXVID,URXMEP,URXMZP,URXMHH,URXMOH,LBX074,LBX105,LBX118,LBXD04,LBXD05,LBXD07,LBXF04,LBXF05,LBXPCB,LBX099,LBX194,LBX196,LBD199,LBXBHC,LBXPDT,LBXTNA,LBXHPE,LBXDIE,URXP04,URXP06,URXP07,URXP17,LBXIRN,LBXALC,LBXBEC,LBXCBC,LBXCRY,LBXGTC,LBXLUZ

There are 19 findings that have replicated effects and have different direction. URXUCD,URXUCS,URXUTL,LBXB12,LBX156,LBXHXC,LBX146,LBX153,LBX170,LBX178,LBX180,LBX187,LBX206,LBXMIR,URXP02,URXP19,LBXLYC,LBXRPL,LBXRST,LBXVIA,URX14D,LBXSPH,URXETL

When we try to evaluate the association between glucose and bmi, these exposure could be potential confounders that influence the phenotypes.

3.) How correlated are the exposures that are replicated body mass index and glucose? Take the union of the variables that your replicated and estimate their correlation (using cor) and plot these using a heatmap (try heatmap.2 from the gplots package). How do the correlations compare qualitatively with correlations observed in genetics (Hint: think linkage disequilibrium). What is the implications of correlations such as these when interpreting potential findings? (5 points)

library(gplots)
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
replicate_list <- union(bmi_list$Exposure.ID , glu_list$Exposure.ID)
replicate_exposure <- NHData.train %>% filter(!is.na(BMXBMI) & is.na(LBXGLU)) %>% select(replicate_list)  
replicate_correlation <- cor(as.matrix(replicate_exposure), use = 'pairwise.complete.obs')

heatmap.2(replicate_correlation,cexCol = 0.7, cexRow = 0.7)

ANS: In GWAS, some SNPs are highly correlated due to linkage disequilibrium, and in EWAS, these correlated exposure could probably caused by the sampling of sources(ex: both collected from air pollution, or plastic products).

4.) Unlike in GWAS with static SNPs (variables that do not change as a function of time, behavior, or disease biology), environmental exposures are highly correlated with one another, they are dependent on time. Therefore, exposures could be subject to many biases. Propose factors that could induce selection bias, confounding bias, and reverse causality. (2 points)

ANS:
Selection bias: Selection bias can result when the selection of subjects into a study or their likelihood of being retained in the study leads to a result that is different from what you would have gotten if you had enrolled the entire target population. For example, using questionnaire can probably obtain only some subgroup samples with older age, and the age difference could also influence the outcome like BMI and glucose.

Confounding bias: Other than sex, race, social-econmic factors, other demographic factors that is not controlled can also be confounders, like occupation. Workers in the factory could have higher opportunities exposing to heavy metals. Also, these workers could spend more time on laboring work instead of stay in the office, and have lower BMI. If the assumption hold, the occupation would be a coufounder between heavy metal exposure and BMI.

reverse causality: People who have higher blood sugar may be more aware of their own diet and lifestyle, so they may be lower in saturated fat and cholesterol, higher in vitamin and other nutrients. In this case would be a reverse causality.

5.) In what other contexts and datasets could you use a method like ‘EWAS’ or ‘GWAS’? What are the advantages and disadvantages of the method? (3 points) ANS: When we try to explore the cause or association of particular outcomes, such as pheotypes or disease, we can make use of this techinique to perform multiple testing. Dataset with large number of variables, such as SNPs or environmental exposure are suitable for such analysis. Other dataset, such EMR- with large number of patient clincal phenotypes can be use to perform Phenome-wide association study to find the association of particular genetic mutation and clinical presentation.

The advantages of this method is that it can efficiently scan through large number of variables, and select the targets for further investigation. The disadvantage is that using multiple testing, it can easily create a lot of false positive results, and need to be adjust.

(Bonus) Executing environment-wide associations in all-cause mortality (20 points):

You just successfully executed cross-sectional associations in two continuous phenotypes, fasting glucose and body mass index.

1.) What does “cross-sectional” mean? How does this study design differ from other designs that take temporality into account? (4 points)

ANS: A study with individual-level variables that measures exposure and disease at one point in time. A snapshot of the study population. This study design provides weak evidence of causal assocation between exposure and outcome because we may not be certain that the exposure preceded the disease.

2.) You hypothesize that certain exposures may increase or decrease risk for mortality, the hardest disease endpoint of all. What modeling techniques can you use to estimate the risk for time to death? (2 points)

ANS: Cox Regression Model.

3.) Two columns in the dataset include PERMTH_EXM and MORTSTAT. Specifically, the US CDC ascertains if individuals surveyed in the NHANES have died in years 2006 and beyond. If they have died at the time of followup, the MORTSTAT variable is equal to 1. If they have not died, the MORTSTAT variable is equal to 0. The time at which death is ascertained from the time an individual surveyed is in the variable PERMTH_EXM. For example, if the PERMTH_EXM variable is equal to 10, and the MORTSTAT is equal to 1, that means the individual died 10 months after the survey when follow. If the MORTSTAT is equal to 1, the individual was alive at the time of follow-up. Use this information to estimate whether any of your top findings in fasting glucose and body mass index is associated with time to death, even after accounting for individual age, sex, race/ethnicity, and income to poverty ratio. Specifically, estimate the risk for death as a hazard ratio for the top finding from your BMI and glucose EWAS. (14 points)

library("survival")
library("survminer")
## Loading required package: ggpubr
## Loading required package: magrittr
survive_data <- NHData.train %>% filter(!is.na(PERMTH_EXM) & !is.na(MORTSTAT)) %>%
  mutate(sex = case_when( male == 1 ~ "male", 
                          female == 1 ~ "female"),
          race = case_when(white == 1 ~ "white",
                           black == 1 ~ "black",
                           mexican == 1 ~ "mexican",
                           other_hispanic == 1 ~ "hispanic",
                           other_eth == 1 ~ "others"))
survive_data$race <- factor(survive_data$race, levels = c("white","black","mexican","hispanic","others"))

dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=survive_data)


for(top_exposure in c("LBXGTC","LBXB12","LBXFOL","LBXGTC","LBXBEC","LBXCBC")){
  f <- str_glue("Surv(PERMTH_EXM, MORTSTAT) ~ scale({top_exposure})+ RIDAGEYR + sex + race + INDFMPIR")
  res_cox <- svycoxph(as.formula(f) , data = survive_data, design = dsn)
  print(res_cox %>% tidy())
}
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (57) clusters.
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = survive_data)
##            term   estimate   std.error  statistic      p.value    conf.low
## 1 scale(LBXGTC)  0.1148822 0.043461069  2.6433350 8.209376e-03  0.02970004
## 2      RIDAGEYR  0.0843721 0.004143123 20.3643708 0.000000e+00  0.07625173
## 3       sexmale  0.6449254 0.108042939  5.9691580 2.384811e-09  0.43316511
## 4     raceblack  0.3086777 0.129467339  2.3842129 1.711570e-02  0.05492638
## 5   racemexican -0.1510516 0.263271828 -0.5737477 5.661385e-01 -0.66705491
## 6  racehispanic -0.1991432 0.188227720 -1.0579906 2.900597e-01 -0.56806270
## 7    raceothers -0.4408497 0.307442570 -1.4339255 1.515935e-01 -1.04342611
## 8      INDFMPIR -0.2358325 0.038074102 -6.1940405 5.864114e-10 -0.31045640
##     conf.high
## 1  0.20006429
## 2  0.09249247
## 3  0.85668565
## 4  0.56242902
## 5  0.36495169
## 6  0.16977640
## 7  0.16172662
## 8 -0.16120866
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (57) clusters.
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = survive_data)
##            term    estimate   std.error  statistic      p.value
## 1 scale(LBXB12)  0.02023399 0.022791024  0.8878052 3.746456e-01
## 2      RIDAGEYR  0.08272030 0.004158308 19.8927795 0.000000e+00
## 3       sexmale  0.62441878 0.111920688  5.5791185 2.417405e-08
## 4     raceblack  0.29612307 0.124469912  2.3790735 1.735621e-02
## 5   racemexican -0.14344298 0.218205240 -0.6573764 5.109389e-01
## 6  racehispanic -0.21822865 0.216585810 -1.0075852 3.136537e-01
## 7    raceothers -0.58585543 0.315841491 -1.8549033 6.361003e-02
## 8      INDFMPIR -0.24476425 0.034186629 -7.1596486 8.087975e-13
##      conf.low   conf.high
## 1 -0.02443560  0.06490357
## 2  0.07457017  0.09087044
## 3  0.40505826  0.84377929
## 4  0.05216653  0.54007962
## 5 -0.57111739  0.28423144
## 6 -0.64272904  0.20627174
## 7 -1.20489338  0.03318251
## 8 -0.31176881 -0.17775969
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (57) clusters.
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = survive_data)
##            term    estimate   std.error  statistic      p.value
## 1 scale(LBXFOL)  0.01874153 0.040884849  0.4583980 6.466665e-01
## 2      RIDAGEYR  0.08240987 0.004129067 19.9584710 0.000000e+00
## 3       sexmale  0.62979477 0.118216397  5.3274739 9.958806e-08
## 4     raceblack  0.30525581 0.123250144  2.4767177 1.325967e-02
## 5   racemexican -0.13545772 0.216902410 -0.6245100 5.322927e-01
## 6  racehispanic -0.21436894 0.218017710 -0.9832639 3.254776e-01
## 7    raceothers -0.60103379 0.313826645 -1.9151777 5.546985e-02
## 8      INDFMPIR -0.24528664 0.033916288 -7.2321193 4.755085e-13
##      conf.low   conf.high
## 1 -0.06139130  0.09887436
## 2  0.07431705  0.09050270
## 3  0.39809489  0.86149465
## 4  0.06368997  0.54682166
## 5 -0.56057863  0.28966319
## 6 -0.64167580  0.21293792
## 7 -1.21612271  0.01405513
## 8 -0.31176135 -0.17881194
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (57) clusters.
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = survive_data)
##            term   estimate   std.error  statistic      p.value    conf.low
## 1 scale(LBXGTC)  0.1148822 0.043461069  2.6433350 8.209376e-03  0.02970004
## 2      RIDAGEYR  0.0843721 0.004143123 20.3643708 0.000000e+00  0.07625173
## 3       sexmale  0.6449254 0.108042939  5.9691580 2.384811e-09  0.43316511
## 4     raceblack  0.3086777 0.129467339  2.3842129 1.711570e-02  0.05492638
## 5   racemexican -0.1510516 0.263271828 -0.5737477 5.661385e-01 -0.66705491
## 6  racehispanic -0.1991432 0.188227720 -1.0579906 2.900597e-01 -0.56806270
## 7    raceothers -0.4408497 0.307442570 -1.4339255 1.515935e-01 -1.04342611
## 8      INDFMPIR -0.2358325 0.038074102 -6.1940405 5.864114e-10 -0.31045640
##     conf.high
## 1  0.20006429
## 2  0.09249247
## 3  0.85668565
## 4  0.56242902
## 5  0.36495169
## 6  0.16977640
## 7  0.16172662
## 8 -0.16120866
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (57) clusters.
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = survive_data)
##            term    estimate   std.error  statistic      p.value
## 1 scale(LBXBEC) -0.24102727 0.059714332 -4.0363387 5.429182e-05
## 2      RIDAGEYR  0.08071093 0.004828037 16.7171307 0.000000e+00
## 3       sexmale  0.63434539 0.138236957  4.5888263 4.457451e-06
## 4     raceblack  0.35099104 0.201563601  1.7413414 8.162375e-02
## 5   racemexican -0.39625303 0.308280259 -1.2853662 1.986643e-01
## 6  racehispanic -0.31042201 0.314730936 -0.9863092 3.239814e-01
## 7    raceothers -1.63268340 0.833080340 -1.9598151 5.001741e-02
## 8      INDFMPIR -0.21854661 0.054030193 -4.0448977 5.234597e-05
##      conf.low     conf.high
## 1 -0.35806521 -0.1239893290
## 2  0.07124815  0.0901737060
## 3  0.36340593  0.9052848462
## 4 -0.04406636  0.7460484435
## 5 -1.00047123  0.2079651754
## 6 -0.92728331  0.3064392898
## 7 -3.26549087  0.0001240587
## 8 -0.32444384 -0.1126493731
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (57) clusters.
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
##     nest = T, data = survive_data)
##            term    estimate   std.error  statistic      p.value
## 1 scale(LBXCBC) -0.15525258 0.057703587 -2.6905188 7.134103e-03
## 2      RIDAGEYR  0.07966211 0.004862256 16.3837753 0.000000e+00
## 3       sexmale  0.64734559 0.135824793  4.7660341 1.878875e-06
## 4     raceblack  0.36339696 0.199164759  1.8246047 6.806070e-02
## 5   racemexican -0.40284446 0.306778973 -1.3131423 1.891350e-01
## 6  racehispanic -0.31545492 0.316828210 -0.9956655 3.194127e-01
## 7    raceothers -1.63145503 0.833753194 -1.9567602 5.037567e-02
## 8      INDFMPIR -0.22610876 0.055369671 -4.0836211 4.433931e-05
##      conf.low    conf.high
## 1 -0.26834953 -0.042155630
## 2  0.07013226  0.089191955
## 3  0.38113389  0.913557296
## 4 -0.02695880  0.753752713
## 5 -1.00412020  0.198431278
## 6 -0.93642680  0.305516958
## 7 -3.26558126  0.002671203
## 8 -0.33463132 -0.117586196

BMI Top 3 findings: 1. LBXGTC g-tocopherol(ug/dL) : Hazard ratio change by a factor of exp(0.1148822) = 1.121, with 1 SD change of LBXBHC, p-value = 0.0082, which mean LBXGTC significantly increase hazard ratio 2. LBXB12 Vitamin B12(pg/mL): Hazard ratio change by a factor of exp(0.02023399) = 1.021, with 1 SD change of LBX099, p-value = 0.37, which is not significant 3. LBXFOL Folate(ng/mL): Hazard ratio change by a factor of exp(-0.0242372010) = 1.019, with 1 SD change of LBX074, p-value = 0.65, which is not significant

Glucose Top3 findings: 1. LBXGTC g-tocopherol(ug/dL) : Hazard ratio change by a factor of exp(0.1148822) = 1.121, with 1 SD change of LBXBHC, p-value = 0.0082, which mean LBXGTC significantly increase hazard ratio 2. LBXBEC trans-Beta carotene (ug/dL): Hazard ratio change by a factor of exp(-0.24102727) = 0.786, with 1 SD change of LBXPFDO, p-value = \(5.4*10^{-5}\), meaning LBXBEC significantly reduce the hazard ratio. 3. LBXCBC cis-Beta carotene (ug/dL):Hazard ratio change by exp(-0.15525258) = 0.856, with 1 SD change of LBX074, p-value = 0.0071, meaning LBXCBC significantly reduce the hazard ratio